Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet
نویسندگان
چکیده
The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet. 1998 ACM Subject Classification E.1 Data Structures, F.2.2 Nonnumerical Algorithms and Problems
منابع مشابه
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length n can be represented as an array of length n words, or, in the presence of the SA, as a bit vector of 2n...
متن کاملOptimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length n from an alphabet [0, σ − 1] so that access and substring-equality queries (that is, determining the equality of any two substrings) can be answered efficiently. A clear lower bound on the size of any prefix-free encoding of this kind is n log σ + Θ(log(nσ)) bits. We describe a new encoding matching this lower bound when σ ≤ nO(1) while su...
متن کاملFaster External Memory LCP Array Construction
The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demo...
متن کاملSpace-Time Tradeoffs for Longest-Common-Prefix Array Computation
The suffix array, a space efficient alternative to the suffix tree, is an important data structure for string processing, enabling efficient and often optimal algorithms for pattern matching, data compression, repeat finding and many problems arising in computational biology. An essential augmentation to the suffix array for many of these tasks is the Longest Common Prefix (LCP) array. In parti...
متن کاملComputing the BWT and LCP array of a Set of Strings in External Memory
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): recent developments in this field have resulted in external memory algorithms, motivated by the large requirements of in-
متن کامل